Introduction to the Table Scan Workflow (VALID-V)

The VALID-V: Table Scan workflow consists of a single function: scan_data(). So simple, and it gives you so much information on a data table. The function generates an HTML report that scours the input table data. This is great to use before diving into the other workflows because it’s a good idea to first understand the target table with some level of precision.

The reporting output contains several sections to make everything more digestible, and these are:

  • Overview: Shows table dimensions, duplicate row counts, column types, and reproducibility information
  • Variables: Provides a summary for each table variable and further statistics and summaries depending on the variable type
  • Interactions: Displays a matrix plot that describes the interactions between variables
  • Correlations: This is a set of correlation matrix plots for numerical variables
  • Missing Values: A summary figure that shows the degree of missingness across variables
  • Sample: A table that provides the head and tail rows of the dataset

An Example with the Palmer Penguins

The output HTML report will appear in the RStudio Viewer and can also be integrated in R Markdown HTML output. Here’s an example that uses the penguins_raw dataset from the palmerpenguins package. In the scan_data() call, the option to deactivate the display of the navigation bar has been taken with navbar = FALSE, which makes some sense when integrating this type of output in a larger document.

scan_data(palmerpenguins::penguins_raw, navbar = FALSE)

Overview of palmerpenguins::penguins_raw

Table Overview

Columns

17

Rows

344

NAs

336 (5.75%)

Duplicate Rows

0

Column Types

character 9
numeric 7
Date 1

Reproducibility Information

Scan Build Time

2020-11-09 23:06:43

pointblank Version

0.5.2.9000

R Version

R version 4.0.3 (2020–10–10)
Bunny–Wunnies Freak Out

Operating System

x86_64-apple-darwin17.0

Variables

Distinct

3 (0.87%)

NAs

0

Inf/-Inf

0

Distinct

152 (44.19%)

NAs

0

Inf/-Inf

0

Mean

63.15

Minimum

1

Maximum

152

Distinct

3 (0.87%)

NAs

0

Inf/-Inf

0

Distinct

1 (0.29%)

NAs

0

Inf/-Inf

0

Distinct

3 (0.87%)

NAs

0

Inf/-Inf

0

Distinct

1 (0.29%)

NAs

0

Inf/-Inf

0

Distinct

190 (55.23%)

NAs

0

Inf/-Inf

0

Distinct

2 (0.58%)

NAs

0

Inf/-Inf

0

Distinct

50 (14.53%)

NAs

0

Inf/-Inf

0

Distinct

165 (47.97%)

NAs

2 (0.58%)

Inf/-Inf

0

Mean

43.92

Minimum

32.1

Maximum

59.6

Distinct

81 (23.55%)

NAs

2 (0.58%)

Inf/-Inf

0

Mean

17.15

Minimum

13.1

Maximum

21.5

Distinct

56 (16.28%)

NAs

2 (0.58%)

Inf/-Inf

0

Mean

200.92

Minimum

172

Maximum

231

Distinct

95 (27.62%)

NAs

2 (0.58%)

Inf/-Inf

0

Mean

4,201.75

Minimum

2,700

Maximum

6,300

Distinct

3 (0.87%)

NAs

11 (3.20%)

Inf/-Inf

0

Distinct

331 (96.22%)

NAs

14 (4.07%)

Inf/-Inf

0

Mean

8.73

Minimum

7.63

Maximum

10.03

Distinct

332 (96.51%)

NAs

13 (3.78%)

Inf/-Inf

0

Mean

−25.69

Minimum

−27.02

Maximum

−23.79

Distinct

11 (3.20%)

NAs

290 (84.30%)

Inf/-Inf

0

Interactions

Correlations

Missing Values

Sample

studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
1 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 2007-11-11 39.1 18.7 181 3750 MALE NA NA Not enough blood for isotopes.
2 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 2007-11-11 39.5 17.4 186 3800 FEMALE 8.94956 -24.69454 NA
3 PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 2007-11-16 40.3 18.0 195 3250 FEMALE 8.36821 -25.33302 NA
4 PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 2007-11-16 NA NA NA NA NA NA NA Adult not sampled.
5 PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 2007-11-16 36.7 19.3 193 3450 FEMALE 8.76651 -25.32426 NA
6..339
340 PAL0910 64 Chinstrap penguin (Pygoscelis antarctica) Anvers Dream Adult, 1 Egg Stage N98A2 Yes 2009-11-19 55.8 19.8 207 4000 MALE 9.70465 -24.53494 NA
341 PAL0910 65 Chinstrap penguin (Pygoscelis antarctica) Anvers Dream Adult, 1 Egg Stage N99A1 No 2009-11-21 43.5 18.1 202 3400 FEMALE 9.37608 -24.40753 Nest never observed with full clutch.
342 PAL0910 66 Chinstrap penguin (Pygoscelis antarctica) Anvers Dream Adult, 1 Egg Stage N99A2 No 2009-11-21 49.6 18.2 193 3775 MALE 9.46180 -24.70615 Nest never observed with full clutch.
343 PAL0910 67 Chinstrap penguin (Pygoscelis antarctica) Anvers Dream Adult, 1 Egg Stage N100A1 Yes 2009-11-21 50.8 19.0 210 4100 MALE 9.98044 -24.68741 NA
344 PAL0910 68 Chinstrap penguin (Pygoscelis antarctica) Anvers Dream Adult, 1 Egg Stage N100A2 Yes 2009-11-21 50.2 18.7 198 3775 FEMALE 9.39305 -24.25255 NA

As could be seen, the first two sections had a lot of additional information tucked behind detail views (with the Toggle details buttons) and within tab sets. Should this amount of information be a little overwhelming, there is the option to disable one or more sections. With scan_data()’s sections argument, you can specify just the sections that are needed for a specific scan.

The default value for sections is the string "OVICMS" and each letter of that stands for the following sections in their default order:

  • "O": "overview"
  • "V": "variables"
  • "I": "interactions"
  • "C": "correlations"
  • "M": "missing"
  • "S": "sample".

This string can contain less key characters and the order can be changed to suit the desired layout of the report. For example, if you just need the Overview, a Sample, and the description of Variables in the target table, the string to use for sections would be "OSV".

Just as with all the other workflows, the tbl supplied could be a data frame, tibble, a tbl_dbi object, or a tbl_spark object. However, there is one limitation here for scan_data(): for tbl_dbi and tbl_spark objects, the Interactions and Correlations sections are currently excluded.

Languages and Locales

The reporting generated by scan_data() can be presented in one of eight spoken languages: English ("en", the default), French ("fr"), German ("de"), Italian ("it"), Spanish ("es"), Portuguese, ("pt"), Chinese ("zh"), and Russian ("ru"). These two-letter language codes can be used as an argument to the lang argument. When applied, all label text and other non-data elements will be set to the language of choice. We have checked the translations with native speakers of the respective languages but if you find an error that should be corrected, please file an issue.

Along with translations, numerical values that are generated as part of the reporting (e.g., table dimensions, summary statistics, etc.) are automatically formatted in the locale of the language (given in lang). This can be overridden with the locale argument which accepts a locale ID. Examples include "en_US" for English (United States) and "fr_FR" for French (France). More simply, this can be a language identifier without a country designation, like "es" for Spanish (Spain, same as "es_ES"). More than 700 locales are currently accepted.